import gzip
import os
import pandas as pd
import numpy as np
For visuals:
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
#%pip install python-decouple
from decouple import config
API_USERNAME = config('USER')
API_KEY = config('PLOTLY_API_KEY')
import chart_studio
chart_studio.tools.set_credentials_file(username=API_USERNAME, api_key=API_KEY)
import chart_studio.plotly as py
import plotly.offline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.go_offline()
# Configure cufflings
cf.set_config_file(offline=False, world_readable=True, theme='pearl')
The dataset chosen for this project belongs to the Amazon product Data (McAuley, 2014). The complete dataset contains product reviews and metadata from Amazon, including 142.8 million reviews ranging between May 1996 and July 2014. Relatively, smaller datasets are available for class projects. The chosen dataset concerns book reviews and is one such smaller dataset. The way that smaller datasets were generated relies on extracting the 5-core, such that each of the remaining users and items have k reviews each. In graph theory, 5-core refers to a graph, where every subgraph has a vertex of degree3 at most k: that is, some vertex in the subgraph is connected with at least k other vertices. A initial description of the dataset on book reviews appears below.
The below functions are provided directly from the Amazon Review Data link by the author and it is used to load the 5-cores) book reviews as a panda dataframe.
def parse(path):
g = gzip.open(path, 'rb')
for l in g:
yield eval(l)
def getDF(path):
i = 0
df = {}
for d in parse(path):
df[i] = d
i += 1
return pd.DataFrame.from_dict(df, orient='index')
df = getDF('../data/raw/reviews_Books_5.json.gz')
I used the below snippet to monitor the memory requirements for the loading.
# %pip install memory_profiler
# %load_ext memory_profiler
# %memit
Below you can see the fields loaded and a count of the values per field;
df.count()
A sample of the overal data appears next:
df[0:10]
In general, the loaded dataframe, include 7 fields:
reviewerID: AString` (probably a hashText) that uniquely identifies the user that submitted the review. asin: ASIN stands for Amazon Standard Identification Number. Almost every product on Amazon has its own ASIN, a unique code used to identify it. For books, the ASIN is the same as the book's ISBN number. reviewerName: The name of the reviewer. helpful: Amazon has implemented an interface that allows customers to vote on whether a particular review
has been helpful or unhelpful. This is captured by this field, which represents a rating of the review, e.g. if [2,3] --> 2/3. reviewText: The actual review provided by the reviewer. overall: The product's rating attributed by the same reviewer. summary: A summary of the review. unixReviewTime: Time of the review (unix time).reviewTime: Time of the review (raw).Of these fields, for the purposes of this project we care to keep the reviewerID, asin, reviewText, overall and helpful. Specifically, we keep reviewerID only to merge it with asin and create unique identifier (key) per review, e.g.:
key = reviewerID:"A10000012B7CGYKOMPQ4L" + asin:"000100039X"
asin is obviously necessary to identify the distinct books in the dataset, while the rest are necessary for the analysis (overall, reviewText) and for evaluation (helpful) purposes.
The inspection begins with a distribution of the review ratings, which appears in the below figure. As it is evident, close to half of the book reviews are rated with 5 stars, a quarter of them with 4 and the remaining quarter is distributed amongst the 3, 2 and 1 ratings. In order to get a clearer idea of how books are rated in overall, rather than look at the distribution of ratings amongst all reviews, the distribution of average book ratings was also generated and appears in Figure 5. The chart was generated to show 100 bins. The highly skewed to the right distribution confirms that the majority of books have reviews between 4 and 5 stars. As such, one should expect that this will be reflected in the text that accompanies the respective reviews, which means that negative reviews will be more difficult to find.
# Number of reviews:
number_of_reviews=len(df)
my_number_string = '{:0,.0f}'.format(number_of_reviews)
print('Number of Reviews: ' + my_number_string + '.')
# Unique number of items:
unique_books=len(df['asin'].unique())
my_number_string = '{:0,.0f}'.format(unique_books)
print('Number of Books: ' + my_number_string + '.')
# Distribution of Ratings (too many to plot with plotly)
fig = df['overall'].plot.hist(alpha=0.5, title='Ratings Distribution', figsize=(15,7), grid=True)
fig.set_xlabel("Ratings")
fig.set_ylabel("Number of Review")
df10 = df[['overall','asin']]
df11 = pd.DataFrame(df10.groupby(['asin'])['overall'].mean())
len(df11)
df11 = df11.reset_index()
df11.head()
#df11['overall'].iplot(kind='histogram', bins=100, xTitle='Rating (0-5)',yTitle='Number of Books', title='Average Book Ratings')
df11.plot.hist(alpha=0.5,bins=100)
df20 = df[['asin','reviewTime']]
def get_year(reviewTime):
day_month_year_list = reviewTime.split(',')
if(len(day_month_year_list)==2):
return day_month_year_list[1]
else:
return fillna(0)
df20['reviewYear'] = pd.DataFrame(df20['reviewTime'].apply(lambda time: get_year(time)))
df20.head()
books_per_year = pd.DataFrame(df20.groupby(['reviewYear']).size())
books_per_year.columns = ['counts']
%jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
books_per_year.iplot(kind='bar', xTitle='Years', yTitle='Number of Reviews', title='Number of Reviews per Year')
df30 = df[['asin','reviewTime', 'overall']]
df30['reviewYear'] = pd.DataFrame(df30['reviewTime'].apply(lambda time: get_year(time)))
df30.head()
books_per_rating_per_year = df30.groupby(['reviewYear','overall']).size().reset_index(name='counts')
books_per_rating_per_year[0:10]
pivot_df = books_per_rating_per_year.pivot(index='reviewYear', columns='overall', values='counts')
pivot_df.iplot(kind='bar', barmode='stack', xTitle='Years', yTitle='Number of Reviews', title='Number of Reviews per Rating per Year')
df40 = df[['asin', 'helpful']]
# Create new Column for the enumerator
df40 = df40.assign(enum = df40['helpful'].apply(lambda enum_denom:enum_denom[0]))
# Create new Column for the denominator
df40 = df40.assign(denom = df40['helpful'].apply(lambda enum_denom:enum_denom[1]))
# Filter on the denom
df40 = df40.loc[df40['denom'] != 0]
df40[0:15]
len(df40)
bin_values = np.arange(start=0,stop=100,step=1)
df40['denom'].plot.hist(alpha=0.5, bins=bin_values, figsize=(15,7), grid=True, title='Distribution of Binary Helpfulness Ratings Counts per Review')
# Focus on [10,100] range of rating per review
df40 = df40.loc[df40['denom'] > 15]
df40 = df40.loc[df40['denom'] < 100]
len(df40)
df50 = df40.assign(percentage = df40['enum']/df40['denom'])
df50['percentage'].iplot(kind='histogram', title='Distribution of Helpfulness Percentage')
df50.head()
threshold = 0.7
df60 = df50.loc[df50['percentage'] > threshold]
len(df60)
# END OF FILE